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ABSTRACT 


In online educational systems we can easily collect and an- 
alyze extensive data about student learning. Current prac- 
tice, however, focuses only on some aspects of these data, 
particularly on correctness of students answers. When a stu- 
dent answers incorrectly, the submitted wrong answer can 
give us valuable information. We provide an overview of 
possible applications of wrong answers and analyze wrong 
answers from three different educational systems (geogra- 
phy, anatomy, basic arithmetic). Using this cross-system 
comparison we illustrate some common properties of wrong 
answers. We also propose techniques for processing of wrong 
answers and their visualization, particularly an approach to 
item clustering based on community detection in a confusion 
graph. 


1. INTRODUCTION 


A key advantage of computerized educational systems is 
their potential for personalization. By analyzing students’ 
answers we can estimate their knowledge using student mod- 
eling techniques and adapt the behaviour of a system to the 
needs of individual students. Student models [6] typically 
utilize only information about correctness of answers. On- 
line systems, however, collect (or can easily collect) much 
richer information, e.g., timing information [18] and specific 
details about answers and individual steps. In this work we 
focus on analysis of wrong answers. 


Wrong or incomplete answers from online educational sys- 
tems have been studied previously, but mostly just as a 
supplementary analysis to other research interests. For ex- 
ample, analysis of programming assignments in MOOCs [9, 
14] shows that the distribution of wrong answers is highly 
skewed, containing few very common wrong answers. This 
research does not, however, focus on analysis of wrong an- 
swers, but rather on finding similar or equivalent solutions 
and their visualizations (as there are many ways how to write 
the same program) [7]. 
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The observation that distribution of wrong answers is highly 
skewed holds not only for programming assignments, but 
also for other domains. For example, common wrong an- 
swers have been used for student modeling in mathemat- 
ics [29], but this work uses only information about whether 
the wrong answer is common or not, it does not utilize actual 
values of wrong answers. Specific student answers were also 
modeled [8], but authors present only overall accuracy of the 
proposed model without discussion of specific mistakes. 


Data analysis techniques has been used for analysis of math- 
ematical errors with the goal of classification (explanation) 
of answers [13, 24]. The results show that it is possible 
to classify most wrong answers into one of few categories. 
Other data-driven techniques in educational data mining 
have focused mainly on programming assignments [10, 21]. 
Rather than “wrong answers” they utilize “incomplete so- 
lutions” and use them for automatic generation of hints 
(changes towards a correct solution). 


In the wider context, wrong answers are related to miscon- 
ceptions, which are intensively studied in pedagogical lit- 
erature, e.g., misconceptions in mathematics [26] or chem- 
istry [22]. This line of research focuses on understanding 
“buggy rules” used by students [4]. These rules are useful 
not just for educating teachers about student thinking, but 
also in development of intelligent tutoring systems. They can 
be also used as a basis of erroneous examples [1, 11]. Re- 
search in this direction is typically based on expert insight 
using only relatively small (and often qualitative) data and 
the focus is typically on complex skills. 


In this work we focus on automatic techniques for anal- 
ysis of large quantitative data, dealing with simple skills 
learning of declarative knowledge and simple procedures). 
We describe analysis of wrong answers from three educa- 
ional systems. Although the used systems share similar 
basic principles they cover widely different domains (geog- 
raphy, anatomy, basic arithmetic) and different learner pop- 
ulations (from kindergarten to university students). Thanks 
0 the size of the used data set (millions of answers), results 
provide interesting insights into properties and potential of 
wrong answers. We describe specific examples of analysis 
and propose novel techniques for analysis and visualization 
of wrong answers. A key observation is that wrong answers 
in our three domain (geography, anatomy, basic arithmetic) 
share many properties and thus it should be feasible to carry 
insights and analysis techniques across domains. 
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2. POTENTIAL APPLICATIONS OF WRONG 


ANSWERS 


In this section we outline potential applications of wrong an- 
swers. The presented applications are rather general and for 
a specific application they need to be more precisely quan- 
tified. In the next section we provide such specific analysis 
for three particular domains. 


2.1 Student and Domain Modeling 

Student and skill models [6] typically utilize only binary in- 
formation about correctness of an answer (correct /incorrect). 
A more thorough analysis of wrong answers may improve 
student and skill modeling in several directions. 


In modeling of cognitive skills, wrong answers may help 
to distinguish between absence of understanding and slips 
(careless errors, typos). Highly uncommon wrong answer is 
more likely to be a careless error, whereas common wrong 
answer is more likely to be a genuine mistake (unless caused 
by poorly designed user interface). Wrong answers may also 
be indicative of the level of knowledge and strategies that 
students are using. Consider for example a multiplication 
5 x 5: a student A answers quickly 30, whereas a student 
B answers 24 after a long time. This may indicate that the 
student A retrieved the answer (incorrectly) from declarative 
memory, whereas the student B made an error in a procedu- 
ral strategy. Wrong answers can thus be useful for modeling 
cognitive processes of learners [27]. Moreover, they may be 
useful also for modeling affect and motivation [29]. Irrele- 
vant, highly uncommon wrong answers (particularly when 
repeated and quickly delivered) are probably indication of 
disengagement rather than lack of knowledge. 


Wrong answer may be useful also for domain modeling. Com- 
mon wrong answers may indicate relations between topics 
and thus may be used for automatic detection of knowledge 
components. Even through these may be misconceived rela- 
tions, when they are common, they may be useful for student 
modeling. Relations between items based on wrong answers 
may also be taken into account in the design of the user in- 
terface or in the item selection algorithm. Wrong answers 
can also be used for student clustering — different groups of 
students make different types of mistakes and need different 
treatment from the educational system (e.g., students with 
dyslexia or dyscalculia). 


2.2 Construction of Items and Hints 

A basic observation about wrong answers, which seems to 
be valid in many different domains, is that the distribution 
of wrong answers is often highly skewed, i.e., some mistakes 
are much more common then others. This feature of wrong 
answers is potentially very useful for construction of ques- 
tions and hints (both manual and automatic). 


Common wrong answers may highlight student misconcep- 
tions and thus provide inspiration for new items (problems). 
In the case of items with simple structure, wrong answers 
may even be used automatically, e.g., as competitive dis- 
tractors in multiple choice questions [16]. Previous work [1, 
11] explored the possibility of using erroneous examples in 
education. Common wrong answers provide useful material 
for creation of such examples. 


Wrong answers may also be useful for development of hints, 
feedback to students, and other scaffolding aids. If the hints 
are developed manually by experts, wrong answers provide 
good way to prioritize the expensive work of an expert. Due 
to the skewed distribution of wrong answers it may be pos- 
sible to quickly provide answer-specific feedback to most an- 
swers even in open environments [9]. It is also possible to 
generate hints automatically based on actions of other stu- 
dents with the same wrong answer [23]. 


2.3 Feedback for Learners, Teachers, and Tool 


Developers 

Analysis of wrong answers can also bring more pragmatic 
advantages. A useful feature of personalized educational 
systems is an overview of mistakes made by a learner or a 
class. Such an overview can serve for example as a base 
for a review session. Teachers may use such overview to 
quickly detect common problems of their students and thus 
focus on problematic parts in classroom time or in personal 
consultations. 


For tool developers common wrong answers may be useful 
as an indicator of problems with a user interface. For exam- 
ple, in a prototype of one of the systems used in this study 
there was a common wrong answer “1” in cases where the 
answer should have been “10”. This turned out to be a user 
interface issue — the system was expecting a single click on a 
“10” button, whereas users were trying to click buttons “1” 
and “0”. 


For these types of applications, basic analysis of wrong an- 
swers should be easily accessible in educational systems for 
both teachers and system developers. Since there can be a 
large number of mistakes, it is important to make the listing 
of mistakes easy to navigate. To achieve this goal, we need 
to understand common features of wrong answers. 


3. ANALYSIS OF WRONG ANSWERS 


After the general discussion of properties and possible ap- 
plications of wrong answers, we turn to analysis of specific 
data. 


3.1 Used Systems and Data 


The used systems cover three different domains (geography, 
anatomy, basic arithmetic) and are used by very different 
learners, but they have been developed by the same research 
group and share the basic principles. All of them focus on 
adaptive practice of declarative knowledge or simple proce- 
dures. Systems estimate learners’ knowledge and based on 
these estimates they adaptively select questions of suitable 
difficulty. They use a target success rate (e.g., 75%) and 
adaptively selects questions in such a way that the learners’ 
achieved performance is close to this target. 


The used questions are either multiple-choice questions or 
“open questions” — either a free text answer or selection of 
any item from a provided context (e.g., “select Rwanda on 
the map of all African states”). For the analysis we use only 
answers to open questions, since the used multiple-choice 
questions have adaptively chosen distractors and this fea- 
ture makes analysis difficult (due to the presence of feedback 
loops [19]). 
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The first system is Outline Maps (outlinemaps.org) for 
practice of geography facts (e.g., names and locations of 
countries, cities, mountains). Details of the behaviour of 
the system are described in [15, 16]. The used data set con- 
tains more than 10 million answers (with more than 1 million 
wrong answers) and is publicly available [17]. This data set 
is the largest of the three used data sets and it is at the core 
of the presented analysis. The application is currently used 
by hundreds of learners per day, majority of learners is from 
the Czech Republic since the interface was originally only 
in Czech. The geographical origin and language of students 
clearly influence interpretation of below presented results. 
However, our main point is not interpretation of particular 
results, but rather illustration of different insight that can 
be gained by the analysis of the data. 


The second system is Practice Anatomy for practicing hu- 
man anatomy (practiceanatomy.com). The main target 
audience of the system consists of junior medical students 
preparing for their anatomy exams. Currently, the system 
offers practice of more than 1800 items organized into 14 
organ systems and 9 body parts. Learners can practice a 
selected organ system or a body part, or specify a more ad- 
vanced practice filter as an intersection of a set of organ 
systems and a set of body parts. The system is available 
in Czech (with Latin terminology) and English. Most users 
are from the Czech republic. The analyzed data set contains 
over 380000 answers. 


The third system is MatMat (matmat.cz) for practice of 
basic arithmetic; its functionality is similar to for example 
Math Garden [24]. The system contains examples divided 
into 5 high level concepts (counting, addition, subtraction, 
multiplication, division), each of these concepts contains 
around 50-700 items, over 2000 items in total. The system 
behaviour and the used student modeling approach are de- 
scribed in [28]. The analyzed data set contains over 180000 
answers. 


Student knowledge and mistakes in the used domains have 
been analyzed before, e.g., recall and mistakes in knowledge 
of US states [20] or knowledge of Europe by Turkish stu- 
dents [25]. These works focused on difficulty of recall of 
individual countries and on factors which influence this dif- 
ficulty (e.g., borders), they did not analyze wrong answers. 
Moreover, we use a data set that is orders of magnitudes 
larger than those used in previous research on geography 
knowledge. The domain of basic arithmetic has been stud- 
ied intensively before, even with the focus on mistakes. A 
well-known example is the repair theory [4] with case study 
for subtraction problems. Particularly multiplication has 
been studied in detail, e.g., description of effects influenc- 
ing difficulty (size effect, five effect, tie effect), connectionist 
model of retrieval [27], classification or errors [5, 24]. Our 
contribution in this domain is mainly in aligning the results 
with analysis from different domains (learning declarative 
knowledge in geography and anatomy). 


3.2 Common Wrong Answers 

Generally the distribution of wrong answers is highly skewed, 
most wrong answers are comprised from just few items. 
Analysis of commonly confused countries shows that the 
most important factors are whether the countries have com- 


Proceedings of the 9th International Conference on Educational Data Mining 


mon border, if they have similar size (important factor par- 
ticularly if they have a common border) and whether their 
name starts with the same first letter (important factor par- 
ticularly if they do not have a common border). There are 
differences between the skewness of the distribution of wrong 
answers for individual items. For some countries there are 
few very typical mistakes — for Bulgaria more than 40% of 
wrong answers are Romania, for Finland the two most com- 
mon wrong answers (Sweden and Norway) comprise nearly 
three quarters of wrong answers. Some countries, however, 
have much more even distribution of wrong answers, e.g., for 
Switzerland or Croatia the most common mistake comprises 
only 10% of wrong answers. 


The context of questions is also important. In the used sys- 
tem countries can be practiced either in the context of a 
single continent or of the whole world. In most cases the 
mistakes on the world map are within the same continent 
(i.e., the wrong answers on the world map are very similar 
to wrong answers within the continent map). There is, how- 
ever, nontrivial number of exceptions, for example: countries 
with similar names, e.g., Guinea, Guyana, and Papua New 
Guinea, which have confusingly similar names and are on 
three different continents; countries close to continent bor- 
ders, e.g., Turkey is confused with European countries and 
Arab countries in Africa and Asia confused; islands are con- 
fused together, e.g., Madagascar is not confused with other 
African countries, but with other islands. These examples 
illustrate the importance of proper practice context for some 
items, e.g., it is not very useful to practice Madagascar on 
the map of Africa, Madagascar should be practiced mainly 
on the map of the whole world. Such results can have direct 
consequences for the design of the behaviour of educational 
systems. 


The data from the MatMat application contain similar pat- 
terns — the distribution of wrong answers is skewed, but 
the skewness of the distribution differs among items. Some 
items have very typical wrong answer (e.g., 1x 1 = 2,4x9 = 
32), for other items wrong answers are more uniformly dis- 
tributed (e.g., 6x8 with answers 42, 54, 56, 64, 78). Previous 
work [24] has analyzed classification of errors in basic arith- 
metic (particularly in multiplication), using categories like 
near miss (+1), typo, operation error, or operand related 
error. In agreement with previous research [13, 24], large 
part of wrong answers fit into one of these categories, and 
the dominant categories are as expected — for counting and 
addition the dominant error type is “near miss”, whereas 
for multiplication a common error is operand related, e.g., 
4 x 9 = 32 (which is 4 x 8). There are, however, interesting 
differences between items of the same type. For division the 
typical mistake is “near miss” (+1). For division by 1 and 
10, however, the typical mistakes are rather answers 1 and 
10; for items of the type N/N common wrong answers are N 
or 0. For small operands (e.g., 4/2) operation errors (multi- 
plication instead of division) sometimes occur, whereas this 
does not happen for larger operands (e.g., 54/6). 


3.3 Categories of Wrong Answers 

To provide a more quantitative analysis and comparison 
across educational systems, we define a coarse classification 
of wrong answers and analyze properties of individual cate- 
gories. We propose the following classification of wrong an- 
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swers into four categories (note that the defined categories 
can be seen as “degrees of wrongness” of an answer with a 
natural ordering). TopWA is the most common wrong an- 
swer for a given item. CWA is acommon wrong answer other 
then the most common one (as a definition of “common” we 
require that the number of occurrences is more then 5% of 
all wrong answers for the given item, it must also be larger 
than 1). Other is any nonempty answer that is not com- 
mon. Missing is an empty answer. Previous research [29] 
used 10% bound for definition of common wrong answers, 
but they did not treat the top wrong answer separately. 


Figure 1 (top) shows distribution of answers among these 
classes. Although there are some differences between the 
used systems (respectively specific maps in the geography 
system), overall the distribution is quite balanced, i.e., the 
used definitions of classes provide reasonable partition of 
wrong answers. The rest of Figure 1 shows characteristics 
of student behaviour related to answers from individual cat- 
egories. Since in this work we are interested mainly in rel- 
ative comparison among types of answers (and not among 
systems), the results are normalized with respect to correct 
answers (for each system). The reported characteristics are 
computed globally. We have also analyzed more detailed re- 
sults (e.g., for specific practice contexts like European coun- 
tries or one digit multiplication), the results show similar 
trends. 


The results show clear trends that are very similar across 
the three used systems. The median response time is larger 
for more wrong answers, with the exception of missing an- 
swers. The probability of leaving the system directly after 
an answer is much higher for wrong answers than for cor- 
rect answers. Also within the wrong answers there is a clear 
trend (the probability of leaving increasing with wrongness). 
Finally, the last two graphs analyze future success of a stu- 
dent; the probability of success on the next question about 
the same item, the probability of success on the next ques- 
tion within the system (global). In both cases there the 
probability of future success decreases with wrongness of 
the current answer. 


We see that there are systematic differences between dif- 
ferent types of wrong answers. The general nature of these 
differences is rather intuitive, the main interesting aspects of 
these results are the similarity of results across three differ- 
ent domains and the consistently linear nature of these rela- 
tionships, i.e., we can say that the distance between TopWA 
and CWA is the same as the distance between CWA and 
Other. The bottom line is that the wrongness of answers 
can be treated as an interval variable and it may be useful 
to utilize it as such for student modeling (for modeling both 
knowledge and affect). 


3.4 Confusion Graph and Item Clustering 

So far we have analyzed wrong answers for each item sepa- 
rately. But mistakes for individual items are clearly inter- 
connected. We can analyze these interconnections with a 
“confusion graph” (a similar analysis has been done previ- 
ously for the domain of statistics [12], but for much smaller 
data). In a confusion graph nodes are individual items, and 
edges correspond to wrong answers — we consider a weighted 
graph where a weight of an edge (u, v) is given by a frequency 
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of a particular wrong answer v among all wrong answers on 
an item u. This definition leads to a directed graph, to ob- 
tain an undirected graph we compute the weight of an undi- 
rected edge by averaging the weights of the corresponding 
directed edges. 


Figure 2 shows the confusion graph for European countries. 
The confusion graph contains distinct clusters of items, this 
observation holds also for confusion graphs of other prac- 
tice contexts in the used systems. To automatically detect 
these clusters we use a community detection algorithm [3]. 
The resulting clusters are meaningful and can provide use- 
ful insight for teachers and developers of educational system 
(Figure 2 for an illustration). The presented clustering was 
obtained by off-the-shelf implementation of the community 
detection algorithm [2] without any tuning. For a specific 
application of such clustering it may be useful to experiment 
with different community detection algorithms and specific 
definitions of the confusion graph. 


3.5 Other Properties of Wrong Answers 
Wrong answer may help us to (quickly) differentiate between 
different groups of users. For example in the geography do- 
main we can see some important differences in wrong an- 
swers of students of different geographical origin, e.g., con- 
fusions between Slovakia and Slovenia, which is much more 
common mistake for US students than for Czech students, 
or wrong answers for Belarus (Bulgaria for US students, 
Ukraine for Czech students). 


Wrong answers differ in their “persistence”, i.e., probability 
that the mistake will be repeated (by the same student) in 
future. For example, consider wrong answers for Ireland. 
United Kingdom is more probable mistake than Italy, but 
the second one is more likely to persists. Other similar ex- 
amples are Moldova (answers Macedonia versus Kosovo) or 
Benin (answers Burundi versus Ghana). Some mistakes are 
very likely to be repeated, e.g., confusion between Zambia 
and Zimbabwe, Gambia and Senegal, or Guinea-Bissau and 
Burkina Faso. 


4. CONCLUSIONS 


Our analysis suggests that wrong answers are underused re- 
source in online educational systems. They are easy to col- 
lect and can provide interesting insight applicable in many 
different ways (student modeling, automatic question and 
hint construction, feedback and inspiration for teachers and 
system developers). We provide a systematic overview of po- 
tential applications of wrong answers and many illustrative 
examples of interesting insights from educational applica- 
tions. 


We also propose specific novel approaches to analysis and 
utilization of wrong answers, particularly a classification of 
wrong answers into four categories (which can be treated 
as “degrees of wrongness”) and clustering of items using a 
confusion graph (based on wrong answers) and a community 
detection algorithm. The results of analysis from three dif- 
ferent domains (geography, anatomy, basic arithmetic) show 
that properties of wrong answers are rather consistent and 
thus the developed approaches should be applicable also for 
other domains. 
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Figure 1: The first line shows frequency of different categories of wrong answers for different systems and for 
selected maps in geography system. The rest of the figure shows properties of different categories of answers 
normalized with respect to correct answers. 
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Figure 2: Left: A confusion graph for European countries (showing only the most significant edges). Right: 
Clustering of European countries based on community detection in the confusion graph. 
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